23 research outputs found
Optimizing Dynamic Time Warping’s Window Width for Time Series Data Mining Applications
Dynamic Time Warping (DTW) is a highly competitive distance measure for most time series data mining problems. Obtaining the best performance from DTW requires setting its only parameter, the maximum amount of warping (w). In the supervised case with ample data, w is typically set by cross-validation in the training stage. However, this method is likely to yield suboptimal results for small training sets. For the unsupervised case, learning via cross-validation is not possible because we do not have access to labeled data. Many practitioners have thus resorted to assuming that “the larger the better”, and they use the largest value of w permitted by the computational resources. However, as we will show, in most circumstances, this is a naïve approach that produces inferior clusterings. Moreover, the best warping window width is generally non-transferable between the two tasks, i.e., for a single dataset, practitioners cannot simply apply the best w learned for classification on clustering or vice versa. In addition, we will demonstrate that the appropriate amount of warping not only depends on the data structure, but also on the dataset size. Thus, even if a practitioner knows the best setting for a given dataset, they will likely be at a lost if they apply that setting on a bigger size version of that data. All these issues seem largely unknown or at least unappreciated in the community. In this work, we demonstrate the importance of setting DTW’s warping window width correctly, and we also propose novel methods to learn this parameter in both supervised and unsupervised settings. The algorithms we propose to learn w can produce significant improvements in classification accuracy and clustering quality. We demonstrate the correctness of our novel observations and the utility of our ideas by testing them with more than one hundred publicly available datasets. Our forceful results allow us to make a perhaps unexpected claim; an underappreciated “low hanging fruit” in optimizing DTW’s performance can produce improvements that make it an even stronger baseline, closing most or all the improvement gap of the more sophisticated methods proposed in recent years
Recommended from our members
The Swiss army knife of time series data mining: ten useful things you can do with the matrix profile and ten lines of code
Class based Influence Functions for Error Detection
Influence functions (IFs) are a powerful tool for detecting anomalous
examples in large scale datasets. However, they are unstable when applied to
deep networks. In this paper, we provide an explanation for the instability of
IFs and develop a solution to this problem. We show that IFs are unreliable
when the two data points belong to two different classes. Our solution
leverages class information to improve the stability of IFs. Extensive
experiments show that our modification significantly improves the performance
and stability of IFs while incurring no additional computational cost.Comment: Thang Nguyen-Duc, Hoang Thanh-Tung, and Quan Hung Tran are co-first
authors of this paper. 12 pages, 12 figures. Accepted to ACL 202
The UCR Time Series Archive
The UCR Time Series Archive - introduced in 2002, has become an important resource in the time series data mining community, with at least one thousand published papers making use of at least one data set from the archive. The original incarnation of the archive had sixteen data sets but since that time, it has gone through periodic expansions. The last expansion took place in the summer of 2015 when the archive grew from 45 to 85 data sets. This paper introduces and will focus on the new data expansion from 85 to 128 data sets. Beyond expanding this valuable resource, this paper offers pragmatic advice to anyone who may wish to evaluate a new algorithm on the archive. Finally, this paper makes a novel and yet actionable claim: of the hundreds of papers that show an improvement over the standard baseline (1-nearest neighbor classification), a fraction might be mis-attributing the reasons for their improvement. Moreover, the improvements claimed by these papers might have been achievable with a much simpler modification, requiring just a few lines of code
Merosesquiterpenes from marine sponge Smenospongia cerebriformis
Using various chromatography methods, three merosesquiterpenes belonging to sesquiterpene quinone type, neodactyloquinone (1), dactyloquinone D (2), and dactyloquinone C (3) together with two indole derivatives indole-3-aldehyde (4) and indole-3-cacboxylic methyl ester (5) were isolated from the methanol extract of the Vietnamese marine sponge Smenospongia cerebriformis. Their structures were determined by 1D-, 2D-NMR spectra, HR-ESI-MS and in comparison with those reported in the literature. Keywords. Smenospongia cerebriformis, merosesquiterpene, sesquiterpene quinone, indole derivative
Recommended from our members
Towards More Accurate Time Series Data Mining by Constraining Model's Flexibility
This dissertation is motivated from enabling various tasks in large scale data mining of time series to produce more accurate, reproducible results and tailored to user’s specific need when that is favored. To that end, we have explored and contributed to the literature in three parts; each touches an active area of research and unifies under a common theme, reducing errors in time series data mining by learning constraints on model’s flexibility.The first body of work concerns Dynamic Time Warping (DTW), a highly competitive distance measure for most time series data mining problems. Obtaining the best performance from DTW requires setting its only parameter, the maximum amount of warping. This parameter gives DTW the flexibility to deal with data that can be locally out of phase, however the DTW algorithm sometimes exploits this flexibility to give pathological and unwanted results. We demonstrate the importance of setting DTW’s warping window width correctly, to constrain this flexibility, and we propose novel methods to learn this parameter in both supervised and unsupervised settings.The second body of work concerns time series motif discovery, perhaps the most used primitive for time series data mining. We point out that the current definitions of motif discovery are limited and can create a mismatch between the user’s intent/expectations, and the motif discovery search outcomes. We explain the reasons behind these issues and introduce a novel and general framework to address them.The last body of work concerns making more time series data sets and baseline results publicly available for gauging progress and comparison of rival approaches in spirit of reproducible research. We work on expanding the UCR Time Series Archive, an important resource in the time series data mining community, from 85 data sets since the last Fall 2015 release to 128 data sets in Fall 2018. Creating benchmark results for this archive required 61,041,100,000,000 DTW comparisons, greatly more than the number of DTW comparisons that have appeared in all research papers combined. Beyond expanding this valuable resource, we offer pragmatic advice to anyone who may wish to evaluate a new algorithm on the archive
Recommended from our members
Towards More Accurate Time Series Data Mining by Constraining Model's Flexibility
This dissertation is motivated from enabling various tasks in large scale data mining of time series to produce more accurate, reproducible results and tailored to user’s specific need when that is favored. To that end, we have explored and contributed to the literature in three parts; each touches an active area of research and unifies under a common theme, reducing errors in time series data mining by learning constraints on model’s flexibility.The first body of work concerns Dynamic Time Warping (DTW), a highly competitive distance measure for most time series data mining problems. Obtaining the best performance from DTW requires setting its only parameter, the maximum amount of warping. This parameter gives DTW the flexibility to deal with data that can be locally out of phase, however the DTW algorithm sometimes exploits this flexibility to give pathological and unwanted results. We demonstrate the importance of setting DTW’s warping window width correctly, to constrain this flexibility, and we propose novel methods to learn this parameter in both supervised and unsupervised settings.The second body of work concerns time series motif discovery, perhaps the most used primitive for time series data mining. We point out that the current definitions of motif discovery are limited and can create a mismatch between the user’s intent/expectations, and the motif discovery search outcomes. We explain the reasons behind these issues and introduce a novel and general framework to address them.The last body of work concerns making more time series data sets and baseline results publicly available for gauging progress and comparison of rival approaches in spirit of reproducible research. We work on expanding the UCR Time Series Archive, an important resource in the time series data mining community, from 85 data sets since the last Fall 2015 release to 128 data sets in Fall 2018. Creating benchmark results for this archive required 61,041,100,000,000 DTW comparisons, greatly more than the number of DTW comparisons that have appeared in all research papers combined. Beyond expanding this valuable resource, we offer pragmatic advice to anyone who may wish to evaluate a new algorithm on the archive